Marco Huang(0201), Jingyun Li(0101)
COVID-19 is the disease caused by SARS-CoV-2, the coronavirus that emerged in December 2019. COVID-19 can be severe, and has caused millions of deaths around the world as well as lasting health problems in some who have survived the illness. The coronavirus can be spread from person to person. It is diagnosed with a test.
Three years after the break out of the corona virus, the growth of the COVID confirmation rate seems to slow down, providing us with the best timing to examine the pandemic as a whole. Here in part one, we would like to look at COVID in the United States, and several representative states in specific, and discuss what the data illustrates to us. For part two, we're going to look at the relation between confirmation cases and housing prices.
The data collection stage is very important. Without proper data to work with, no analysis can be done. Make sure to find credible and recent data to create accurate models and analysis.
In this project the Covid-19 data we used comes from Johns Hopkins University and is available at this link: https://github.com/CSSEGISandData/COVID-19
We used the following tools to collect this data: pandas, numpy, matplotlib, scikit-learn, seaborn, os, folium, and more.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn
import warnings
import os
import folium
import scipy.stats as stats
from statsmodels.formula.api import ols as o
from sklearn import linear_model
import re
warnings.filterwarnings('ignore')
1.2.2 US overall
We want to first look at the overall confirmed and death cases in the US. Here we read the data of the confirmed covid cases throughout the whole world from 1/22/20 till today. For this project, We would focus on the united states.
Below is the global confirmation data. It includes all the countries, their latitude, longitude, and the daily cumulative confirmation among all these countries.
world_conf = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv", sep=',')
world_conf.head()
We first extract the confirmation data for the US from the world data frame. We then calculated the increase of confirmation covid for every day in between and transposed it afterward.
us_conf = pd.melt(world_conf, ['Province/State','Country/Region', 'Lat', 'Long'], var_name="Date", value_name='conf_cases')
us_conf = us_conf.drop(columns=['Province/State', 'Lat', 'Long'])
us_conf = us_conf.rename(columns={'Country/Region': 'Country'})
us_conf["Date"] = pd.to_datetime(us_conf['Date'])
us_conf = us_conf.groupby(['Country', 'Date']).sum()
us_conf["Next_day"] = us_conf['conf_cases'].shift(fill_value=0)
us_conf["conf_change"]= us_conf['conf_cases'] - us_conf['Next_day']
us_conf = us_conf.drop(columns=['Next_day'])
us_conf = us_conf.reset_index()
us_conf = us_conf[us_conf["conf_change"] >= 0]
us_conf = us_conf[us_conf["Country"]=="US"]
us_conf = us_conf.set_index("Date")
us_conf = us_conf.drop(columns=['Country'])
us_conf.head()
Below is the global death data. It includes all the countries, their latitude, longitude, and the daily cumulative death among all these countries.
world_death = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv", sep=',')
world_death.head()
We did the same thing for the death data -- we calculated the increase of death cases for every day in between and transposed it afterward.
us_death = pd.melt(world_death, ['Province/State','Country/Region', 'Lat', 'Long'], var_name="Date", value_name='death_cases')
us_death = us_death.drop(columns=['Province/State', 'Lat', 'Long'])
us_death = us_death.rename(columns={'Country/Region': 'Country'})
us_death["Date"] = pd.to_datetime(us_death['Date'])
us_death = us_death.groupby(['Country', 'Date']).sum()
us_death["Next_day"] = us_death['death_cases'].shift(fill_value=0)
us_death["death_change"]= us_death['death_cases'] - us_death['Next_day']
us_death = us_death.drop(columns=['Next_day'])
us_death = us_death.reset_index()
us_death = us_death[us_death["death_change"] >= 0]
us_death = us_death[us_death["Country"]=="US"]
us_death = us_death.set_index("Date")
us_death = us_death.drop(columns=['Country'])
us_death.head()
We then joined the tables into a data frame us_overall. The new data frame has confirmed cases, daily confirmed the change, death cases, and daily death change data all in one.
us_overall = us_conf.join(us_death, how='outer')
us_overall.head()
1.2.2 US states
Below is the confirmed cases in each state of the United States. Here we would also want to sort out some states that is representative of a certain area. Below is the states we picked for this project, we selected one state for each of the 9 regions.
We first read in the data from the Hopkins site.
conf = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv", sep=',')
conf.head()
We choose 9 state which locate the 9 area of US.
MD = conf[conf["Province_State"] == "Maryland"]
frames = [MD]
confirmed = MD.drop(conf.columns[0:11], axis=1)
confirmed = confirmed.append(confirmed.sum(numeric_only=True), ignore_index=True)
confirmed.drop(confirmed.index[0:26], inplace=True)
list1 = ["Maine", "New York", "Wisconsin", "Kansas", "Alabama", "Texas", "Arizona", "California"]
for x in list1:
state = conf[conf["Province_State"] == x]
frames.append(state)
time = state.drop(state.columns[0:11], axis=1)
sum = time.append(time.sum(numeric_only=True), ignore_index=True)
confirmed = confirmed.append(sum.sum(numeric_only=True), ignore_index=True)
confirmed = confirmed.rename(index={0: 'Maryland', 1: 'Maine', 2: 'New York', 3: 'Wisconsin', 4: 'Kansas', 5: 'Alabama', 6: 'Texas', 7: 'Arizona', 8: 'California'})
result = pd.concat(frames)
confirmed = confirmed.swapaxes("index", "columns")
confirmed.index = pd.to_datetime(confirmed.index)
confirmed.head()
Number of deaths by state in the US
us_dead = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv", sep=',')
us_dead.head()
The totle number of dead in 9 states.
MD2 = us_dead[us_dead["Province_State"] == "Maryland"]
death = MD2.drop(MD2.columns[0:12], axis=1)
death = death.append(death.sum(numeric_only=True), ignore_index=True)
death.drop(death.index[0:26], inplace=True)
for x in list1:
state2 = us_dead[us_dead["Province_State"] == x]
time2 = state2.drop(state2.columns[0:12], axis=1)
sum2 = time2.append(time2.sum(numeric_only=True), ignore_index=True)
death = death.append(sum2.sum(numeric_only=True), ignore_index=True)
death = death.rename(index={0: 'Maryland', 1: 'Maine', 2: 'New York', 3: 'Wisconsin', 4: 'Kansas', 5: 'Alabama', 6: 'Texas', 7: 'Arizona', 8: 'California'})
death = death.swapaxes("index", "columns")
death.index = pd.to_datetime(death.index)
death.head()
us_overall.plot(y="conf_cases", legend=None)
us_overall.plot(y="conf_change")
2.1.2 Deaths trend in the US
us_overall.plot(y="death_cases", legend=None)
us_overall.plot(y="death_change")
# , figsize=(15,10)
confirmed.plot()
Cases in nine states continue to trend upward. It is obvious that the number of confirmed cases in June of 2022 will suddenly increase significantly. One possible reason is that people go on vacation during the summer vacation, which further increases the chance of contact.
death.plot()
We then did the same thing to the 9 states selected as what we did to the global confirmed and death database-- we calculated the daily increase confirmation rate throught the nine selected states and transposed the dataframe.
result = result.drop(result.columns[[0,1,2,3,4,7,10]], axis=1)
result = pd.melt(result, ['Admin2','Province_State', 'Lat', 'Long_'], var_name="Date", value_name='Cases')
result = result.drop(columns=['Province_State'])
result = result.rename(columns={'Admin2': 'Admin', 'Long_': 'Long'})
result["Date"] = pd.to_datetime(result['Date'])
result = result.groupby(['Admin', 'Date']).sum()
result["Next_day"] = result['Cases'].shift(fill_value=0)
result["Daily_change"]= result['Cases'] - result['Next_day']
result = result.drop(columns=['Next_day'])
result = result.reset_index()
result = result[result["Daily_change"] >= 0]
Afterwards, we put them into the US map and represent the daily increase as a heat map.
result["Date"] = result["Date"].astype(str)
fig = px.scatter_geo(result, lat="Lat", lon="Long",
hover_name="Admin", size="Daily_change",size_max=80,
animation_frame="Date",
scope = "usa",
title = "Total Cases")
fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 100
fig.show()